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ABSTRACT 

The combination of higli-density transposon- 
mediated mutagenesis and high-throughput 
sequencing has led to significant advancements in 
research on essential genes, resulting in a dramatic 
increase in the number of identified prokaryotic es- 
sential genes under diverse conditions and a revised 
essential-gene concept that includes all essential 
genomic elements, rather than focusing on 
protein-coding genes only. DEG 10, a new release 
of the Database of Essential Genes (available at 
http://www.essentialgene.org), has been developed 
to accommodate these quantitative and qualitative 
advancements. In addition to increasing the 
number of bacterial and archaeal essential genes 
determined by genome-wide gene essentiality 
screens, DEG 10 also harbors essential noncoding 
RNAs, promoters, regulatory sequences and repli- 
cation origins. These essential genomic elements 
are determined not only in vitro, but also in vivo, 
under diverse conditions including those for 
survival, pathogenesis and antibiotic resistance. 
We have developed customizable BLAST tools that 
allow users to perform species- and experiment- 
specific BLAST searches for a single gene, a list 
of genes, annotated or unannotated genomes. 
Therefore, DEG 10 includes essential genomic 
elements under different conditions in three 
domains of life, with customizable BLAST tools. 

INTRODUCTION 

Delineating a set of essential genomic elements and 
proteins that make up a living organism helps to under- 
stand critical cellular processes that sustain life (1-3). 
Identification of essential genes is especially useful to 



studies of synthetic biology (4), which seeks to make an 
artificial self-sustainable living cell, with addable gene 
circuitries that encode desirable traits. Bacterial essential 
genes, because of their lethality phenotype, are attractive 
drug targets, and this is especially important for those 
having multidrug resistance (5). 

Reverse genetics (from gene disruption to phenotypic 
characterization) has been extensively used to experimen- 
tally determine essential genes. One standard method is 
to perform targeted mutagenesis in a particular gene of 
interest. Classical examples include essential gene deter- 
mination in Bacillus subtilis (6) and Escherichia coli (7), 
in which all protein-coding genes are deleted one by one. 
This method gives a clear-cut answer on gene lethality, but 
it is labor-intensive, time-consuming and requires detailed 
genome annotation. Single-gene knockout screens can 
overlook genes causing synthetic lethality, which refers 
to lethal phenotypes caused by genetic interactions of 
genes that are nonessential when deleted separately (3). 
Indeed, duphcated genes are less likely to be essential 
than singletons (8). Another method is to construct a 
random transposon-insertion hbrary, followed by deter- 
mination of insertion sites by DNA hybridization (9) or 
microarray (10), which suffers from some shortcomings 
including missing low-abundance transcripts, low reso- 
lution in locating insertion sites, and narrow ranges in 
counting probe density. An advantage of global trans- 
poson mutagenesis is that it can simultaneously identify 
essential noncoding elements in addition to protein-coding 
genes. 

The combination of high-density transposon-mediated 
mutagenesis and high-throughput sequencing has resulted 
in significant advancements in the study of essential genes 
(11). This method, however, has in fact been gradually 
developed for more than 10 years. In 1999, Venter and 
coworkers first performed Sanger sequencing to determine 
transposon insertion sites (12), and later, various versions 
of combining transposon mutagenesis and next-generation 
sequencing were developed, such as TraDIS (13), INSeq 
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(14), HITS (15), Tn-seq (16) and Tn-seq Circle (17), here 
collectively referred to as Tn-seq. The application of 
Tn-seq has allowed for significant advancements in 
studies on essential genes over the past few years, resulting 
in (i) a dramatic increase in the number of prokaryotic 
species with gene essentiahty screens; (ii) a revision of 
the essential-gene concept that includes all essential 
genomic elements, such as noncoding RNAs, rather than 
focusing on protein-coding genes only and (iii) gene essen- 
tiahty screens in a wide array of experimental conditions 
in vitro and in vivo, rather than focusing only on rich 
media in ceU culture. 

We constructed a database of essential genes (DEG) in 
2004 (18), and DEG 5.0 included essential genes of both 
bacteria and eukaryotes (19). In addition to DEG, other 
essential gene databases include EGGS (Essential Genes 
on Genome Scale, http://www.nmpdr.org/FlG/eggs.cgi) 
and OGEE (online gene essentiality database) (20), 
where the former hosts microbial gene essentiahty data 
experimentally obtained from pubhshed genome-scale 
gene essentiality screens and the latter hosts essential- 
gene data obtained from large-scale experiments with 
associated gene features and text-mining results. Because 
of text-mining results, OGEE has most essential-gene 
records, while DEG entries are human curated and is 
the only one supporting BLAST searches. We have con- 
structed DEG 10 to accommodate the quantitative 
and quahtative advancements in identifying essential 
genes by genome-wide essentiality screens in recent 
years, and the following is a summary of new database 
developments. 

(i) In addition to protein-coding genes, DEG 10 now 
harbors essential genomic elements, including 
noncoding RNAs, promoters, regulatory sequences 
and replication origins (21,22). 

(ii) The number of bacteria with saturated genome-wide 
gene essentiahty screens has nearly tripled, 
compared with that in DEG 5 (19). 

(hi) DEG 10 contains essential genomic elements 
determined not only in vitro (culture dishes), but 
also in vivo (intact mice) (14), not only for survival 
but also for pathogenesis (23), not only in rich 
media, but also in more diverse conditions, such 
as those required for cholesterol catabolism (24), 
antibiotic resistance (17), bile acid tolerance 
(13,25) and bacteriophage infection (26). 

(iv) DEG 10 hosts archaeal essential genes determined 
from the first gene essentiahty screen in an archaeal 
genome (27). 

(v) DEG 10 hosts both essential and nonessential 
protein-coding genes. 

(vi) DEG 10 is integrated with customizable BLAST 
tools that allow users to perform species- and 
experiment-specific searches for a single gene, a list 
of genes, annotated or unannotated genomes. 

Therefore, DEG 10 (www.essentialgene.org) reflects the 
progress of the research on essential genes by including 
essential genomic elements under different conditions in 
three domains of life, with customizable BLAST tools. 



DATABASE NEW DEVELOPMENTS 

Increased number of bacterial species with genome-wide 
essentiality screens 

The combination of high-throughput sequencing and 
high-density transposon mutagenesis has largely 
accelerated the process in determining essential genes. 
Compared to DEG 5 (19), the number of bacteria with 
saturated genome-wide gene essentiahty screens has nearly 
tripled in DEG 10, which has data for 31 bacteria. DEG 
10 contains more than 12 000 bacterial essential genes, 
more than twice the number of those in DEG 5. The 
figures corresponding to newly added essential genes are 
highlighted in Table 1. 

In addition to essential genes, in fact, nonessential genes 
can be determined as well in most genome-wide essential- 
ity screens. Single-gene knockout experiments directly 
determine whether a particular gene is essential or nones- 
sential. Genome-wide transposon mutagenesis determines 
nonessential genes first, because all recovered mutants 
only harbor transposon insertions in nonessential genes, 
while essential genes are, in fact, inferred. Therefore, 
nonessential genes can be rehably identified by both 
kinds of approaches. Because information about nones- 
sential genes can be important as weU, DEG 10 hosts 
nonessential genes, which are organized into a sub- 
database. 

Determination of essential noncoding genomic elements 

It is increasingly being recognized that bacterial genomes 
encode large amounts of noncoding RNAs (28). The use 
of high-density transposon mutagenesis and high-through- 
put sequencing makes identification of essential noncoding 
RNAs possible. In the genome of Caulobacter crescentus, 
428 735 unique Tn5 insertions were generated and mapped 
in its 4Mb genome. Therefore, in addition to identifying 
480 essential protein-coding genes, 29 tRNAs and eight 
smaU noncoding RNAs were also found to be essential 
(22). In Mycobacterum tuberculosis, 36488 transposon 
insertions were generated and mapped, and in addition 
to essential protein-coding genes, 25 nondisruptable 
genomic segments were found. These segments include 
10 tRNAs and the RNA catalytic unit of RNaseP, 
which is required for tRNA processing (21). In a study 
with a similar method for the Salmonella serovars 
Typhimurium, 15 noncoding RNAs were found to be es- 
sential (29). It is noteworthy that RNaseP was again 
among the identified essential noncoding RNA, and there- 
fore it is hkely to be a widely required noncoding RNA 
among bacteria. 

Mann et al. tested the hypothesis that some noncoding 
RNAs have niche-specific roles in virulence (23). Because 
increasing evidence suggests sRNAs are involved in patho- 
genesis, Mann et al. first performed RNA-seq to define the 
sRNA repertoire of 5. pneumonia, a causative agent for 
pneumonia, and identified 89 sRNAs. To examine organ- 
specific roles in pneumococcal pathogenesis, they 
generated a pool of pneumococcal mutants by transposon 
mutagenesis, administrated the mutants in organs vital 
to the progression of pneumococcal diseases, the 
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Figures in bold denote newly added essential genes. 

^Genetic footprinting is a method that performs transposon mutagenesis followed by PCR to determine transposon insertion sites (79). Tn-seq here 
collectively refers to a method that uses the next-generation sequencing to determine transposon insertion sites, including, TraDIS, INSeq, HITS, 
Tn-seq and Tn-seq Circle. 



Nucleic Acids Research, 2014, Vol. 42, Database issue D577 



nasopharynx, lungs and bloodstream, and performed deep 
sequencing in DNA from recovered mutants. 
Consequently, 28 sRNAs in the lung, 26 in the 
nasopharynxand 18 in the blood were found to alter 
fitness in these host niches. Therefore, this study used 
Tn-seq to assay the role of sRNA in pathogenesis in a 
niche-specific manner (23). 

In addition to noncoding RNAs, other noncoding 
elements of the genome can be essential as well. These 
include promoters of some essential protein-coding 
genes, regulatory sequences and rephcation origins. 
Indeed, Christen et al. identified 402 essential promoter 
regions and two essential elements in the replication 
origin of the Caulobacter genome, in addition to 91 essen- 
tial intergenic sequences with unknown functions (22). 
Zhang et al. identified 35 intergenic elements for optimal 
growth of M. tuberculosis (21). DEG 10 collects the above 
identified noncoding genomic elements, with annotations 
from the Rfam database (30), if relevant annotations 
are available. Because of the apparent essential role of 
replication origins, DEG 10 also links to DoriC, which 
is a database of bacterial and archaeal replication 
origins (31). 

Determination of essential genes under diverse conditions 

The application of high-throughput sequencing makes it 
possible to determine and quantify contributions of essen- 
tial genes to organism fitness under conditions that are not 
practical by using other methods, because of the digital 
nature of the next-generation sequencing. Therefore, in 
addition to regular rich medium in cell cultures, in the 
past several years, bacterial essential genes have been 
identified under a large number of different conditions, 
e.g. in intact mice. 

One illustrative example is the study on genes required 
to estabhsh a human gut symbiont (14). Goodman et al. 
first performed the INseq method (transposon mutagen- 
esis foUowed by next generation sequencing) to identify 
a set of essential genes for the commensal 
B. thetaiotaomicron in vitro. Next, they examined the 
genes critical for fitness in vivo, i.e. in a mammahan gut 
ecosystem by colonizing bacterial mutants in germ-free 
mice. By comparing the input (before inoculation) and 
output (recovered bacteria), 280 genes showed underrepre- 
sentation, suggesting them to be critical for in vivo fitness. 
By changing the experimental conditions such as in the 
presence of human gut-associated bacteria, they identified 
five adjacent genes that conferred fitness disadvantages 
during monoassociation of germ-free mice, while 
showing no impact on bacterial growth in vitro, thus high- 
lighting the importance of the in vivo context in 
determining gene essentiality (14). 

In a large-scale study, van Opijnen and Camilh per- 
formed Tn-seq on the genome of S. pneumonia under 17 
in vitro conditions (e.g. pH, temperature, antibiotic, heavy 
metal, stress, nutritional stimulation) and two in vivo con- 
ditions (carriage and infection), and have identified over 
1,800 genotype-phenotype genetic interactions and 
associated pathways (32). Other condition-specific 
studies include the identification of essential genes for 



bile acid tolerance, a trait required of an enteric bacterium 
and for carriage of S. Typhi in the gall bladder (13,25), 
resistance to the aminoglycoside antibiotic tobramycin 
in Pseudomonas aeruginosa (17), bacteriophage infection 
of S. Typhi to assess for Vi polysaccharide capsule expres- 
sion (26) and cholesterol metabolism in M. tuberculosis 
(24). In addition to those determined in rich medium 
only, DEG 10 harbors condition-specific essential genes 
as well. 

Determination of essential genes in an archaeal genome 

Archaea are prokaryotes that constitute a separate domain 
of life, in addition to bacteria and eukaryotes (33). Some 
archaea can survive in extreme conditions, such as highly 
salty or hot environments. Methanogenesis, a process to 
generate methane, is a specialized anaerobic respiration 
that requires distinctive biochemical reactions unique to 
methanogenic archaea, which are responsible for 80% of 
the methane in greenhouse gas (34). 

By using the method of Tn-seq, Sarmiento et al. 
identified essential genes in hydrogenotrophic, methano- 
genic archaeon Methanococcusmaripaludis S2, and this 
was the first genome-wide gene essentiahty screen in 
archaea (27). About 89000 unique transposon inserts 
were mapped, and 526 genes were classified as essential 
in rich medium. Similar to bacteria, many essential genes 
encode fundamental cellular processes, such as transcrip- 
tion, translation and replication. Some essential genes, 
however, are unique to the archaeal or methanococcal 
hneages. For instance, the DNA polymerase PolD is 
essential, whereas the archaeal homolog of bacterial 
PolB is not (27). 

Determination of essential eukaryotic genes 

In contrast to prokaryotic essential genes, which have had 
a dramatic increase in past years, the number of eukary- 
otic essential genes, while climbing steadily, does not 
exhibit a drastic increase, apparently due to the lack of 
genome-wide mutagenesis strategies. To generate single- 
gene knockout, however, takes much more effort, and 
therefore usually requires multi-center collaborations. 
The aim of the International Knockout Mouse 
Consortium (IKMC), formed in 2007, is to generate 
mutant mouse lines with all genes deleted one by one 
(35). A recent report showed mouse gene deletion 
mutants have been obtained for ~ 17 000 of the total 
20000 protein-coding genes (36). Therefore, in the near 
future, we expect to have a complete set of essential 
mouse protein-coding genes. With Saccharomyces 
cerevisiae being the first eukaryote to have all of its 
single-gene deletion mutants generated (37), DEG 10 has 
added essential genes of Schizosaccliaromyces pombe, 
which is the second eukaryote that has a saturated gene 
deletion study (38). 

Customizable BLAST tools 

Performing homologous searches with the BLAST 
program (39) against DEG is common (40-43), and there- 
fore to facilitate this use, we have developed a set of 
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customizable BLAST tools. Users have the following four 
options. 

(i) To perform BLAST search for a single gene. The 
major improvement for this option is that users now 
can perform species-specific BLAST search, in 
addition to having the option to change P-values. 
The output is unprocessed BLAST raw results. 

(ii) To perform BLAST search for a Ust of genes. Users 
can submit a Ust of protein or DNA sequences, and 
the BLAST output will be organized and processed 
to generate an XML file that is parsed by the 
Biopython module (44). The output includes how 
many genes among the queried gene set have 
DEG homologs, and how many homologous genes 
in DEG are found. All homologous genes are click- 
able by linking to corresponding ahgnments. The 
above function can also be done in a species- 
specific manner. 

(iii) To perform BLAST search for annotated genome 
sequences. Because of the increasing pace of 
genome sequencing, in many cases users need to 
analyze whole-genome sequences. By using this 
option, users can submit a whole-genome sequence 
or scaffold, with annotation information, i.e. either 
in the GenBank format or by uploading Protein 
Table Files (PTT format). 

(iv) To perform BLAST search for unannotated genome 
sequences. If users need to analyze whole-genome 
sequences that have not been annotated, DEG is 
integrated with two gene-finding programs, Zcurve 
(45) and Glimmer (46), for gene identification. 
Protein-coding genes are first identified by Zcurve 
or Glimmer, and then BLAST searches are per- 
formed against DEG. For both Options 3 and 4, 
the output is processed and organized to convey 
information on the number of homologs in DEG 
and in queried genomes, with finking to alignments. 
The XML files and resulting webpages are stored 
for 7 days on the server, and can be retrieved as 
needed. 

With the aforementioned new tools, users can perform 
BLAST searches for single genes, multiple genes, 
annotated genomes or unannotated genomes with filters 
to restrict the search to a subset of species or experiments 
with desirable P-values. 



FUTURE PERSPECTIVE 

Recent breakthroughs in sequencing technology, i.e. the 
next-generation sequencing that parallelizes the process to 
sequence milhons of reads concurrently, have fundamen- 
tally changed many areas of biological research, and the 
research on essential genes is no exception. Significant ad- 
vancements have been made in essential-gene studies, for 
example, the concept of the essential gene has been revised 
to include all essential genomic elements, rather than 
focusing on protein-coding genes only. It is not difficult 
to envision that in the near future, genome-wide gene es- 
sentiality screens will be performed in a large number of 



bacteria and archaea, under increasingly diverse experi- 
mental conditions, and will result in dramatic increases 
in identified prokaryotic essential genomic elements. The 
accumulation of essential-gene information will be par- 
ticular helpful in identifying bacterial drug targets (47) 
and in constructing the minimal genome in studies of syn- 
thetic biology (4). Without breakthroughs in genome-wide 
mutagenesis technology, however, there will likely be no 
dramatic increases in identified eukaryotic essential genes. 
Nevertheless, it is expected that single-gene knockout 
projects for the model organisms, such as mice and 
Arabidopsis thcdiana (48), will soon be completed. It is in- 
creasingly being recognized that mammalian genomes 
have highly complex transcriptomes (49), and therefore 
we would expect that some eukaryotic noncoding 
elements, such as long noncoding RNAs, will be identified 
as essential. DEG will continue to incorporate newly dis- 
covered essential genomic elements in a timely manner to 
keep pace with this rapidly developing field. 
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